47 research outputs found
Parallel machine architecture and compiler design facilities
The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role
Fast speculative address generation and way caching for reducing L1 data cache energy
L1 data caches in high-performance processors continue to grow in set associativity. Higher associativity can significantly increase the cache energy consumption. Cache access latency can be affected as well, leading to an increase in overall energy consumption due to increased execution time. At the same time, the static energy consumption of the cache increases significantly with each new process generation. This paper proposes a new approach to reduce the overall L1 cache energy consumption using a combination of way caching and fast, speculative address generation. A 16-entry way cache storing a 3-bit way number for recently accessed L1 data cache lines is shown sufficient to significantly reduce both static and dynamic energy consumption of the L1 cache. Fast speculative address generation helps to hide the way cache access latency and is highly accurate. The L1 cache energy-delay product is reduced by 10% compared to using the way cache alone and by 37% compared to the use of multiple MRU technique.Peer ReviewedPostprint (published version
Power-aware load balancing of large scale MPI applications
Power consumption is a very important issue for HPC community, both at the level of one application or at the level of whole workload. Load imbalance of a MPI application can be exploited to save CPU energy without penalizing the execution time. An application is load imbalanced when some nodes are assigned more computation than others. The nodes with less computation can be run at lower frequency since otherwise they have to wait for the nodes with more computation blocked in MPI calls. A technique that can be used to reduce the speed is Dynamic Voltage Frequency Scaling (DVFS). Dynamic power dissipation is proportional to the product of the frequency and the square of the supply voltage, while static power is proportional to the supply voltage. Thus decreasing voltage and/or frequency results in power reduction. Furthermore, over-clocking can be applied in some CPUs to reduce overall execution time. This paper investigates the impact of using different gear sets , over-clocking, and application and platform propreties to reduce CPU power. A new algorithm applying DVFS and CPU over-clocking is proposed that reduces execution time while achieving power savings comparable to prior work. The results show that it is possible to save up to 60% of CPU energy in applications with high load imbalance. Our results show that six gear sets achieve, on average, results close to the continuous frequency set that has been used as a baseline.Peer ReviewedPostprint (published version
Direct instruction wakeup for out-of-order processors
Instruction queues consume a significant amount of power in high-performance processors, primarily due to instruction wakeup logic access to the queue structures. The wakeup logic delay is also a critical timing parameter. This paper proposes a new queue organization using a small number of successor pointers plus a small number of dynamically allocated full successor bit vectors for cases with a larger number of successors. The details of the new organization are described and it is shown to achieve the performance of CAM-based or full dependency matrix organizations using just one pointer per instruction plus eight full bit vectors. Only two full bit vectors are needed when two successor pointers are stored per instruction. Finally, a design and pre-layout of all critical structures in 70 nm technology was performed for the proposed organization as well as for a CAM-based baseline. The new design is shown to use 1/2 to 1/5th of the baseline instruction queue power, depending on queue size. It is also shown to use significantly less power than the full dependency matrix based design.Peer ReviewedPostprint (published version
A novel architecture for large windows processors
Several processor architectures with large instruction windows have been proposed. They improve performance by maintaining hundreds of instructions in flight to increase the level of instruction parallelism (ILP). Such architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of the processor resources. Check-pointing, however, leads to an imprecise state recovery on mispredicted branches and exceptions and frequent re-execution of current-path instructions during the state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This technical report proposes a new processor architecture that does not use either a traditional ROB or check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. Its novel register management architecture allows implementation of large register files with simpler and more scalable, register renaming and commit. It is also key to the precise recovery mechanism.Postprint (published version
High performance annotation-aware JVM for Java cards
Early applications of smart cards have focused in the area of per-sonal security. Recently, there has been an increasing demand for networked, multi-application cards. In this new scenario, enhanced application-specific on-card Java applets and complex cryptographic services are executed through the smart card Java Virtual Machine (JVM). In order to support such computation-intensive applica-tions, contemporary smart cards are designed with built-in micro-processors and memory. As smart cards are highly area-constrained environments with memory, CPU and peripherals competing for a very small die space, the VM execution engine of choice is often a small, slow interpreter. In addition, support for multiple applica-tions and cryptographic services demands high performance VM execution engine. The above necessitates the optimization of the JVM for Java Cards
CAMFAS: A Compiler Approach to Mitigate Fault Attacks via Enhanced SIMDization
The trend of supporting wide vector units in general purpose microprocessors suggests opportunities for developing a new and elegant compilation approach to mitigate the impact of faults to cryptographic implementations, which we present in this work.
We propose a compilation flow, CAMFAS, to automatically and selectively introduce vectorization in a cryptographic library
- to translate a vanilla library into a library with vectorized code that is resistant to glitches. Unlike in traditional vectorization, the proposed compilation flow uses the extent of the vectors to introduce spatial redundancy in the intermediate computations. By doing so, without significantly increasing code size and execution time, the compilation flow provides sufficient redundancy in the data to detect errors in the intermediate values of the computation.
Experimental results show that the proposed approach only generates an average of 26\% more dynamic instructions over a series of asymmetric cryptographic algorithms in the Libgcrypt library
Recommended from our members
Improving interpreted execution performance with Java bytecode SuperOperators
This paper exploits the concept of optimizing the interpreted execution of Java programs with SuperOperators (SOs). SOs are groups of bytecode operations used to produce interpreter engines with specialized instructions. The present work makes 3 distinguished contributions to this topic.Firstly, we show that less than 20 SOs formed by basic blocks cover more than 50% of all bytecodes executed by an application and are enough to yield the bulk of performance improvement when optimizing interpreters with SOs. We analyze SOs formed by the most frequently executed program basic blocks and SOs formed by special sub-patterns of Java bytecode operations that compose the basic blocks. Such sub-patterns are extensions of PicoJava's stack operation folding (OF) patterns. Unlike SOs formed by basic blocks, we find OF patterns repeat across a wide range of applications.Secondly, we compare techniques for optimizing interpreters with SOs. We show that the number of stack accesses and stack pointer updates, implicit in the bytecode semantics, is more limiting to the interpreter performance than the bytecode dispatch overhead. Our findings suggest that an interpreter that fully optimizes the top SOs formed by basic blocks, reducing both sources of overhead, yields up to fourfold performance improvement compared to previous techniques.Finally we assess the efficiency of a software implementation of the stack operation folding mechanism. We design statically customized interpreter versions that use a limited number of non-patented Java bytecode opcodes to represent SOs formed by OF patterns valuable across applications. We also propose a dynamic scheme that is more flexible in customizing the interpreter for a particular application. Both approaches use annotation attributes in the class files marking occurrences of the most valuable SOs, dispensing with the expensive pattern search and classification at runtime. Our statically customized interpreter versions, deploying a limited subset of SOs, and our dynamically customized version improve the performance of SPEC JVM98 and Java Grande Forum benchmarks by 7% to 39%
Reducing leakage power in peripheral circuits of L2 caches
Leakage power has grown significantly and is a major challenge in microprocessor design. Leakage is the dominant power component in second-level (L2) caches. This paper presents two architectural techniques to utilize leakage reduction circuits in L2 caches. They primarily target the leakage in the peripheral circuitry of an L2 cache and as such have to be able to cope with longer delays. One technique exploits the fact that processor activity decreases significantly after an L2 cache miss occurs and saves power during L2 miss service time. Two algorithms, a static one and an adaptive one, are proposed for deciding when to apply this leakage reduction technique. Another technique attempts to keep the peripheral circuits in a lower-power state most of the time. The results for SPEC2K benchmarks show that the first technique can achieve a 18 to 22 % reduction in L2 power consumption, on average (and up to 63%), depending on the decision algorithm. The second technique can save 25%, on average (and up to 80%). This comes with a negligible 1 to 2% performance impact, on average, depending on the technique used